Autumn 2021 - Social Graphs and Interactions (02805) - DTU

Import

Harry Potter and the Graph of Secrets - Final Project

Table of Contents

1. Motivation

Choosing the main topic for our final project was not easy, as the internet is full of fandoms and there are millions of pages we could have turned to. But all three members of this wonderful team felt the same when we thought of Harry Potter. We reminisce about those fantasy films we grew up with and dreamt about so many times. All of us agreed that Harry Potter had influenced our childhood and adolescence in a way that was as special as few other book and film sagas.

This project makes it clear that motivation towards a task completely changes the perspective with which the task is approached. In our case, and from the first day we decided to tackle this topic, we felt the magic of the Harry Potter world welling up inside us and turned us into frenzied fans of the magical world at the mercy of networks and graphs. The opportunity we had before us to venture into the secrets of the connection between characters, between the houses of Hogwarts, or even the development of the feelings of the most important characters throughout the saga... Simply thrilling!

Our most arduous task is to lead the audience, namely you, along this path in which enormous amounts of information are revealed and to keep the mystery and excitement of further reading. We have tried to make this adventurous journey through data analysis that combines the world of knowledge with the magic of JKRowling's pen as bearable as possible. So, without further ado, welcome to Harry Potter and the Graph of Secrets.

2. Basic Statistics

The data used and needed for the analysis has been extracted and prepared for this project, and it is detailed below:

2.1 Data Preparation

First of all, we started with a study of the Harry Potter wiki page, looked at how we could extract information and brainstormed possible ideas for the scope of our final project. We decided to begin by extracting information on as many characters as possible and, to commence with, categorising them by their Hogwarts houses:

  • Gryffindor
  • Ravenclaw
  • Hufflepuff
  • Slytherin
  • Unknown_House

Eventually, we added a fifth cathegory called _"UnknownHouse" as some of the characters are not members of a house. The list of houses is then created so we can iterate on each house and extract each list of members.

Gryffindor Ravenclaw Hufflepuff Slytherin

Gryffindor

Ravenclaw

Hufflepuff

Slytherin

Once the information has been exctracted, a dataframe is created linking each of the characters to their respective houses.

In order to proceed with the analysis, a file containing all the useful information extracted from the Harry Potter wiki is made. Then, all the files are stored in the folder Description, for a later use in the project.

Similarly to the previous step, a new folder is created named Description_extract where the files of all the characters are created and stored, these containing the descriptive text for each of them. In order to achieve that and get the "clean" version of each character description, we had to carefully analyse, detect and eliminate each of the patterns effectively. When dealing with this task we encountered a huge amount of different patterns we had to look for and carefully remove to leave the relevant information intact for us. These are some examples:

...and much more.

As it can be seen, there was much to take into consideration when dealing with the cleaning of the text, not only with different forms of holding the text but also unicode and special characters. The final version of the main patterns used can be seen below:

pattern_1 ='(=+[\w\s\–\(\)\'^\w\/\"\,\.\w\-\w]+=+)'
pattern_2 ='(=+(?:[\w\s]+=+\s)+[\w\s]+=+)'

In order to cleanorganisational characters that leaked through and only keep relevant characters, we clean or df as following:

The function find_cat is created in order to detect in an easy and effective way the category.

From each of the files for each Harry Potter character that we have saved in the folder Description, we still have plenty of information to gather like gender, blood type (considering if it is a descendant from a muggle family or not), nationality, height, weight, and much more. To do that, the function get_all_attributs is created below.

As an example, here are the attributes we obtain from the Ronald Weasley file:

Finally, we use the get_all_atributs function to, in fact, get all the attributes of the characters and practically store add them as new columns of information in the dataframe so we are able to sort and use the relevant information we filtered.

As this process we just went through is quite tedious but specially time consuming, we store the final dataset so that we do not have to re-run the code each time we open the notebook again and again. This way, we just create the file pickle_network_prep.txt, store it, and just load it when we go back to the code!

It is time to create generate our first gfraph of this network. Each node will have several features, all related with the character it corresponds to.

Now that we have generated the graph G, we want to remove those nodes with a degree equal to zero as they are not relevant for our analysis (a total of 315 nodes). After removing those nodes from our graph, we proceed to remove the characters as well from our Dataframe. Therefore, we are left with just the characters that matter for our study.

In this part, as we did during the exercices at class, we decided to consider only the GCC of the graph G. We also reduce the DataBase created with only the names of the characters in the GCC.

2.2 Data and Basic Statistics

We start by looking at the size of the network we are dealing with:

The most connected characters are detailed below, from an in, out and all degree perspective. The names shown were expected as they are some of the most famous characters.

Intriguing, but who are really the characters with most connections from each of the 4 houses of Hogwarts? We find the answer here:

Now that we have had an overview of the characters and their degrees, we now plot the degree disrtribution of the graph to get a better grasp and overview of the network we approach.

For a random regime network, the std should be equal to the square root of (k), or at least very similar. But since we have a large difference between these values, and also that the total-degree exponent for power law fit is very close to 3, we conclude that it follows a scale-free regime. Reference

We now proceed to check on the "The 80/20 Rule". As Vilfredo Pareto stated after a study carried out in the 19th century, "roughly 80 percent of money is earned by only 20 percent of the population." Ref. Following the NetworkScienceBook by A.L.Barabási, "The 80/20 rule is present in networks as well: 80% of links on the Web point to only 15% of webpages; 80% of citations go to only 38 percent of scientists; 80% of links in Hollywood are connected to 30% of actors. Most quantities following a power law distribution obey the 80/20 rule."

Let's find out how this 80/20 rule behaves in our Network!

In the table above we can see that the character that makes the 80% of the links of the Harry Potter wikipage is called "Cadogan", representing the 26.6% of all the characters in the network. We now proceed to neatly plot this finding:

Finally, we conclude that our network follows the 80/20 rule, as we can see that in our case the 80% of the network is represented by the 26.6% of the characters. Ref.

2.3 Visualization

3. Tools, theory and analysis

3.1 Project 1: WordCloud analysis of houses

3.1.1 Motivation

This part is about analyzing the different houses of Hogwarts with the use of WordClouds, i.e. we will generate images consisting of words, with the size of the words in the image proportional to the number of times that word appears in the string used to generate the image. Firstly, we want to analyze how the members of the different houses speak and see if there are any patterns/trends there. Secondly, we want to use the characters' wiki-descriptions to look for any trends/differences between the houses.

3.1.2 Tool

Part a: WordClouds based on character dialogue

For the analysis of the way members of the different houses speak, we will use the movie scripts data. In order to have useful data for this part, we will need to load and prepare the movie data, which can be done in the following steps:

Once the data has been prepared we can start analyzing it using the wordcloud library. One way to do this is to simply combine all tokens of characters belonging to one house, convert them into one string, and then use that string to generate the wordcloud. While this should provide a decent representation, there is a better way to do it, that is, using the tc-idf score of the words. The tc-idf score of a word is the product of its term count and the inverse document frequency, where term count is the number of times that word appears in a document (in our case a character's complete dialogue) and the inverse document frequency is the log of the total number of documents divided by the number of documents in which the word appears. Using the tc-idf ensures that words that don't appear often have greater importance, meaning the more unique words said by characters will be weighted more, hence the differences should be more noticeable using this method.

We will use both methods and compare the results. We will first create the WordClouds simply based on the characters' pure dialogue for each house where the steps are as follows:

We will then create the WordClouds based on the words' tc-idf scores, where the steps are as follows:

Part b: WordcCouds based on character wiki-pages

For the analysis of the characters' wiki-pages, we will use the DataFrame created from the Harry Potter wiki pages. We will again need to prepare this data, however as most of this has already been done, this part is fairly short and just about loading the right data. The steps are as follow:

Now that our data has been prepared, we will again create WordClouds based on both the pure character descriptions and the tc-idf scores of the words. This process is identical to steps 7 - 13 in part a.

3.1.3. Applying the tool

Part a: WordClouds based on character dialogue

In this section we will follow the steps described in the previous section.

Part b: WordClouds based on character wiki-pages

We will again follow the steps from the previous section

3.1.4 Discuss the outcome

As a summary, TC-IDF by assigning a value to a term according to its importance in a document scaled by its importance across all documents in your corpus mathematically eliminates occurring words in the English language and in the corpus, and selects words that are more descriptive of your text. \ NLP tasks such as text summarization, information retrieval, and sentiment classification are therefore some tasks that utilize TC-IDF for its powerful weighing operation. <!-- * Wordcloud TC-IDF result analysis:

Gryffindor:

  • Voldemort
  • quirrel, flitwick, sprout * Ravenclaw: The words "School", "year" and "Hogwarts" have been erased in the Word clouds made with TC-IDF. \ Slytherin: The words "first", "family" and "Hogwarts" have been erased in the Word clouds made with TC-IDF. \ Hufflepuff: The words "School", "newt" and "Hogwarts" have been erased in the Word clouds made with TC-IDF. \ Sprout: Head of Hufflepuff -->

As the tc-idf WordClouds are more descriptive, we will use these for analyzing both the script and the wiki descriptions.

3.2 Project 2: Communities

i. Points of interest

In this section we will group all the characters together into communities using the Louvain algorithm. We will do this in order to see if any interesting partitions occur, e.g. will all the professors end up in a group together? Will the "good" characters be grouped together and the "evil" characters together?

ii. Explain the tool

This section contains two parts, first the creation of the communities using the Louvain algorithm, and secondly some analysis of the communities.

part a: Identifying the communities

The Louvain algorithm belongs to the class of modularity maximization algorithms for community detection, it tries to optimize the overall modularity of the found partitioning. In terms of its functioning, it works by repeating two steps until the communities found in the last iteration don't have a better modularity than those in the previous iteration. These 2 steps are:

The algorithm ends when there is no more positive gain achieved to the modularity of the network when re-doing step one.

We are using the Louvain algorithm in order to identify our communities. It is worth noting that the Louvain algorithm is nondeterministic and dependent on the order in which the nodes are processed. That is why we have used a random seed, so that we can comment on the results without having to worry about them changing if the notebook is run again.

part b: Analysing the communities

Once we have identified the communities using the Louvain algorithm, we will use several techniques to identify what trends each communities have.

iii. Applying the tool

part a: Identifying the communities

We will now apply the tools described in ii to create the communities.

The modularity is a measure of the structure of networks or graphs, i.e. measures the strength of division of a network into communities. The closer to 1, the better is the division. \ As we can see Louvain modularity is way higher than House modularity or the Gender modularity even if a modularity of 0.28 is not remarkably high either.

part b: Analysing the communities

We will now analyse the identified communities, going through the steps 1-4 described in ii.

iv. Discussing the results

We will now have a look at the results and see if we can conclude anything interesting about these communities. Of course we can instantly see from the size distribution plot that community 1 is the largest community by far. Furthermore, when we look at the 10 characters with the highest degree of each community, we also see that this community 1 contains most of the main characters, including Harry, Ron and Hermione. Looking at the other communities' characters with the highest degrees, we can see some interesting patterns form:

We can see how the Louvain algorithm has managed to group the characters together, often managing to separate characters of the movies from the characters from the games. The algorithm also managed to put old Gryffindor quiddich captain into one group, and another past quiddich playes in another group. The size of the communities varied a lot however, with community 1 having almost 200 members while several other communities didn't even have 10. This meant that some of the communities would be hard to analyze using WordCloud and sentiment analysis since there wouldn't be enough data for proper analysis.

For the WordClouds based on the wiki descriptions, we decided to only take those communities who had at least 10 members, as to not get anywhere the WordClouds were only based on two or three people. This leaves us with 8 communities to calculate the WordClouds for. We can see in the WordCloud for community 2 that this community is indeed all about quiddich, as some of the most important words are "quiddich", "appears" and "team", and the fact that "Gryffindor" and "captain" are also very important words shows that it is indeed past Gryffindor quiddich captains. Communities 1 and 4 have several similarities, with words such as "Voldemort", "death", "family", "battle" and "friend" appearing, with high importance, in both WordClouds. As these two communities are both large and seem to mainly consist of characters from the movies, it makes sense that a lot of the words are similar. However, as we found that community 4 had many members of the Slytherin house, one might have expected the WordCloud for community 4 to be more "evil" than community 1, but that doesn't seem to be the case. In communities 8 and 9, we can again see that many of their members were involved with quiddich, especially from the Slytherin house, as words such as "quiddich", "slytherin" and "captain" are very large in these WordClouds. The WordCloud for community 10 also shows that this community does indeed consist of ghosts, as words such as "ghost" and "spirit" are very large, and the names of the ghosts also appear often, e.g. "headless", "nick", "helena", "ravenclaw", etc.

When doing the sentiment analysis of the communities, we could only perform this on the communities which contained characters who spoke in the movies, hence only communities 1, 4, 5, 10 and 13 could be analyzed. Furthermore, as the characters from communities 5 and 10 don't have many lines in the movies, not a lot can be concluded from them. However, out of the remaining communities, communities 1, 4 and 13, it can be seen that it is actually community 1 which has the lowest sentiment through the movies. This could be because it is often the main characters who experience suffering and grief in the movies.

3.3 Project 3: Sentiment Analysis

3.3.1 Motivation

In this section we will look at the characters' sentiment throughout the movies. We will look at the main characters' individual sentiment change throughout the movies, as well as the overall sentiment of each movie and the individual houses' sentiment change throughout the movies. We will be looking at the sentiment of each chapter, and will thus be able to get an insight into the movies.

3.3.2. Tool

First of all, as we will be working with the movie script dataset, we need to first prepare the data. We do that the same way as in 3.1 Project 1:WordCloud analysis of houses - steps 1 to 5 where we fix the names, add house data, clean the text and tokenize the text. We are going to be using both the VaderSentiment and LabMT sentiment analysers to analyse the sentiment of the script.

3.3.3. Applying the tool

part a: Sentiment of characters throughout the movies
part b: Sentiment analysis each house throught the chapters

3.3.4. Discuss the outcome

As the sentiments are calculated per chapter, there are several characters whose sentiment graphs aren't complete as they don't appear in every chapter. When we look at the main character, Harry Potter, we can see that the Vader sentiment and LabMT sentiment generally tend to follow each other. Sometimes one will have a more extreme peak or dip than the other, for example, the LabMT sentiment has a much more dramatic drop than the Vader sentiment at the end of movie 1, where as it's the other way round for a scene near the start of movie 2. The sentiment of Harry Potter seems to remain fairly similar throughout all the movies for both models. When we look at the other two main characters' sentiment compared to Harry Potter's, we can see that they follow each other quite nicely with a few exceptions, Ron for example has a large dip in the first movie where neither of the others have, and both Ron and Hermione have several drops in movie 7 and the start of 8 where Harry doesn't. Looking at some of the other characters, we see that Dumbledore's sentiment is almost always positive/above average, where as Voldemort's is the opposite, as would be expected.

We will now look at the overall sentiment change of the movies as well as the sentiment of each house. First we notice that although the sentiment varies for each chapter, the average sentiment of each movie seems to be about the same. We notice a very large dip in sentiment at the start of movie 2. This is from the scene "Writing on the wall" where Harry hears the snake in the wall talking about killing and and murdering, and Argus Filch says he wants to kill Harry when he sees his cat is dead and thinks Harry was the one who did it. This explains why the sentiment is so low for this scene. The Gryffindor sentiment follows the overall sentiment almost perfectly, which is likely because many of the main characters are from Gryffindor, hence they make up a large part of script. There is however a large drop in sentiment in the LabMT model towards the end of movie 1. For the Ravenclaw and Hufflepuff houses, the same problem as with the characters occurs; there are a lot of chapters where they don't talk, so it is hard to conclude much from their sentiment. For the Slytherin house, we can see that their sentiment is often lower than that of Gryffindor. At the end of movie 8 we see Slytherin has a very high sentiment quickly followed by a very low sentiment, suggesting something changes for the worse for the Slytherins at the end, which is indeed intact with the actual movie.

3.4 Project 4: Similarity between characters

3.4.1 Motivation

This graph made with the Fandom Wiki has some underlying knowledge that can be extracted by its structure. The Louvain community algorithm helped us to understand some of the structures by clustering individuals in groups also known as communities. However, this does not help us to get information on a node level.

For instance, what is the most similar/close character to Harry Potter? With the Louvain community algorithm, I can only say that Harry potter belongs to a certain community but nothing more. May I do better with?

3.4.2 Tool

Prerequisite: Skim gram model If you are familiar with the word2vec skip-gram model, great, if not I recommend this great post (http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) which explains it in great detail as from this point forward I assume you are familiar with it.

The skip-gram neural network model is actually surprisingly simple in its most basic form. Train a simple neural network with a single hidden layer to perform a certain task, but then we’re not actually going to use that neural network for the task we trained it on! Instead, the goal is actually just to learn the weights of the hidden layer–we’ll see that these weights are actually the “word vectors” that we’re trying to learn.

Generation_skip

training

Node2vec is an algorithm to generate vector representations of nodes on a graph. node2vec follows the intuition that random walks through a graph can be treated like sentences in a corpus. Each node in a graph is treated like an individual word, and a random walk is treated as a sentence.

By feeding these "sentences" into a skip-gram, or by using the continuous bag of words model paths found by random walks can be treated as sentences, and traditional data-mining techniques for documents can be used.

The algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and argues that the added flexibility in exploring neighborhoods is the key to learning richer representations of nodes in graphs. The algorithm is an extension of Gensim's word2vec algorithm, and is considered one of the best classifiers for nodes in a graph.

The algorithm follow 2 steps: Generation of a chain of nodes with a sampling strategy and training of a Word2vector algorithm such as the skim-gram model.

Node to vector

High level process:

  1. From a graph, start from a node and generate a defined number of walks through the network with a certain length, that I will call "Sentence". For instance (Node 1 -> Node 5 -> Node 8 -> Node 6)
  2. Each of the sentences will be used to train a Word2vector and the output will attached for each unique node, a unique vector.
  3. This vector can be compared with other vector by using the cosine similarity
  4. The most similar vectors can be traceback to their nodes.

For more information, I recommend you to have a look at this creative post: https://towardsdatascience.com/node2vec-embeddings-for-graph-data-32a866340fef

Example: What is the node that is the most similar to the node 1

Step 1: Sentences generation

  • (Node 1 -> Node 5 -> Node 8 -> Node 6)
  • (Node 4 -> Node 2 -> Node 10 -> Node 14)
  • ...
  • (Node 3 -> Node 54 -> Node 8 -> Node 25)

Step 2: Training of a Word2vector and generation of a unique vector for each node

  • Node 1: vec_node_1=[0.1,0.53,0.958]
  • Node 2: vec_node_2=[-0.7,0.1,-0.4]
  • ...
  • Node 76: vec_node_76=[-0.1,-0.5,0.6]

Step 3: Compute the cosinus similarity to between each node:

  • (Node 1, Node 2)= cosinus_similarity (vec_node_1,vec_node_2)= 0.6
  • ...
  • (Node 1, Node 76)= cosinus_similarity (vec_node_1,vec_node_76)= 0.3

Step 4: Sorting the cosinus similarity of the node 1 with all the nodes and return the most similar one

  • Most_similar(Node 1) = Node 5

3.4.3 Apply the tool

Step 0: Import + graph with weight + useful functions

Step 1: Sentences generation \ In the following cell, I will pick the 10 most linked nodes based on all degree and I will generate 200 walks of length 25. Therefore a total of 2000 walks of length 25 will be generated.

Step 2: Training of a Word2vector and generation of a unique vector for each node

3.4.4 Discuss the outcome

Finding 1: Who is Harry Potter's closest characters according to the graph?

Step 3: Compute the cosinus similarity to between each node\ Step 4: Sorting the cosinus similarity of the node "Harry Potter" with all the nodes and return the most similar one

  • Hermione Granger and Ronald Weasley: Together with Harry, they are forming the Trio of the saga.
  • Tom Riddle: Also known as Voldemort, which is the supreme enemy of Harry Potter.
  • Draco Malfoy: It is the rival of Harry Potter.
  • Ginevra Weasley: She is Harry Potter's wife
  • Dolores Umbridge and Severus Snape: Two professors (Head of the Improper Use of Magic Office and Co-Leader of Duelling Club at Hogwarts) that teach Harry Potter at Hogwarts
  • Rubeus Hagrid: Hagrid is Harry's friend – and soon, he's also friends with Ron and Hermione. He invites them all to tea at his house on their days off, and he tells them the secret of his baby dragon.
  • Neville Longbottom: They were good friends as well as classmates and roommates at Hogwarts and Neville supported Harry against Voldemort by joining Dumbledore's Army and destroying the last piece of Horcrux that made Voldemort mortal again, Nagini. Also, both James and Lily(Harry's parents) and Frank and Alice(Neville's parents) were out and out in opposition to Voldemort as well as they were notable members of The Order of the Phoenix.
  • Sirius Black: Sirius is Harry's godfather and sometimes treats him like a son, but at other times he seems to forget Harry isn't his best friend, James Potter
  • Albus Dumbledore: Dumbledore is the man from whom Harry has learned most about being a wizard and a human being. Throughout their six years together, Albus Dumbledore spends a great deal of his time teaching Harry about life in the way most parents do.
Finding 2: Who is Harry Potter's lover, according to the graph?

Since all the nodes are now encoded into vectors, we can do some mathematical operations with it!

For instance: King-man+Woman = Queen

Since we only have the name of the characters, I want to play with their hearts. We know that Ronald Weasley and Hermione Granger are together but I can't find the name of Harry Potter's wife.. To find it, let's do: \ Harry Potter + (Love) = Harry Potter's lover \ Harry Potter + (Ronald Weasley - Hermione Granger) = Harry Potter's lover

Wonderful !! Thank you Node to Vector for refreshing my memories!

Node to vector

Finding 3: Can we see understand the louvain communities with the Node to Vector algorithm?

At first glance, the t-SNE algorithm comes with better differentiation between the Community than the House. This confirms the modularity higher found by the Louvain Algorithm for the partition rather than the house.

To sum up, the Louvain algorithm proved that it can pick up communities in a better way than just looking at the house and this can be shown by plotting each of the nodes (represented by a vector created by the Node to Vector Algorithm),

3.5 Project 5: Emotion

3.5.1 Motivation

The emotion behind a character is defined by more than just positive or negative sentiment. To get complex emotions such as happiness, hate or worry, we need to train a model to recognize those sentiments on a training set and apply it to Harry Potter movie scripts. This technique is called transfer learning.

3.5.2 Tool

a) Train a model on a training set

For the analysis of the emotion of the characters, we will use a database composed of tweets that have been classified into 13 emotions (Main emotion of the tweet). Link: https://www.kaggle.com/icw123/emotion

b) Apply the model on a HP scripts

3.5.3 Apply the tool

a) Train a model on a training set

Step 1: Loading the packages and creation of the functions that will clean the data.

Step 2: Loading the packages and creation of the functions that will clean the data.

Step 3: Reduction of the number of emotions from 13 to 11 (anger and boredom) due to lack of examples to train the model.

Step 4: Converting each emotion to an integer

Step 5: Preprocess the tweet to remove useless part of the string.

Step 6: Filtering data to only use tweets with more than two words after processing

Step 7: Convert words into vectors of numbers to allow use within models by using TfidfVectorizer with unigrams and bigrams.

Step 8: Resample the data so that each emotion has an equal number of tweets.

As the data is very imbalanced (759 enthusiasm tweets vs 8638 neutral) this can negatively affect the training of the model. Therefore I will use SMOTE to resample the data so that each emotion has an equal number of tweets. Regular resampling can result in overfitting, as the model will be trained on identical rows of data many times, whereas SMOTE involves the creation of new, synthetic data which is based on the existing data, so it will be similar but not identical. This helps to avoid overfitting.

Step 9: Train our favorite ML algorithm on the tweets

b) Apply the model on a HP scripts

Step 10: Preprocess the Harry Potter's scripts

Step 11: Apply our favorite ML algorithm on Harry Potter's scripts

Step 12: Converting each emotion integer into emotion string

Step 13: Addition of columns of futur plotting

c) Plotting the different outputs

3.5.4 Discuss the outcome

The plot above shows the impact of each emotion or feeling in each scene of the entire Harry Potter saga. The weight of the feelings of each scene has been calculated by considering the following emotions: happiness, love, surprise, fun, enthusiasm and relief are all rather good emotions, whereas worry, sadness and hate are bad. Finally, we have neutral, and empty groups all the rest of emotion that are obtained from the models.

Below we take a closer look at the analysis of the results of this fantastic tool we have developed to study the unfolding of sentiment in films.

Happiness

  • When looking at the impact happiness has in the movies we can clearly see 3 spikes. The first two we see make a lot of sense since they are located towards the ending of movies 1 and 2, which are well known for having very happy endings in comparison with the rest of the saga.
  • Another spike can be seen in Harry Potter and the Deathly Hallows part 1, an the moment when Hermione and Harry discover that they can use Gryffindor sword, which was left for Harry by Dumbledore, to destroy Horcruxes.
  • Just by looking at the plot it can be seen that happiness is in decline as the saga progresses, and in fact, it is widely acknowledged by fans that the films get darker and darker, going from a compilation of children's films to a totally adult mood.

Sadness

  • The most highlighted moment of sadness throughout all of the movies can be seen towards the end of the last movie, when Harry Potter dies (coming back to life afterwards) and Hogwarts is partially destroyed when fighting against the dark lord Voldemort.
  • A very relevant part considering sadness can be noted at the end of Harry Potter and the Order of the Phoenix, when Bellatrix Lestrange strikes Sirius Black with the killing curse and kills him in from of Harry. A very sad moment.
  • It can also be seen at the beginning of the chamber of secrets a spike, related to the visit Dobby pays Harry to warn him about going to Hogwarts.

Love

  • As expected, we can see love highlights in the beginning and ending of the first and second movies, and at the end of the saga when Harry kills Voldemort, Hogwarts is saved and there is a fast-forward where we can see Harry and Giny Weasley married and with a son.
  • Highest spikes in love are shown in movies 4 and 7. When looking into the Goblet of Fire, those scenes are related to the Quidditch world cup, where there is a lot of fun and party and all the Weasleys, Hermione and Harry are together. In the Deathly Hallows part 1, at the beginning of the movie, the Dursleys leave Privet Drive and leave Harry, and Hermione uses "obliviate" on her parents to erase their memories to make them think they did not have a daughter. Both are very loving emotive moments.

    Hate

  • In regards to hate, there is a very remarkable scene in the Chamber of Secrets. Harry wanders through the corridors of Hogwarts following the sounds and whispers of the Basilisc. Then he encounters MrNorris, Argus Filch's cat, petrified and hung on a lamp next to a message written on a wall. Filch is absolutely furious about this and thinks Harry did that to his cat. Also, Hermione claims that the message on the wall is written in blood. There is a lot of hate in this scene overall, mainly driven by Argus Filch's thoughts towards harry for killing his cat. Amazing insight!

Fun

  • According to the results, the most fun scene happens at the end of Harry Potter and the Prisoner of Azkaban. Harry and Hermione save Sirius Black and Buckbeak, and afterwards go back to the hospital wing to visit Ron. He is already awake and asks how he and Hermione got outside the wing when they were right in front of him (time traveling).

Worry

  • This emotion is quite stable or monotonous throughout the films. It can be said that due to the very nature of the style of the films, the plot and the events, a very " worrying " result was to be expected in general. In addition, when compared to other feelings such as happiness or sadness, the weight or percentage impact on the overall feeling of each scene is much higher on average for worry.

Relief

  • At the beginning of the Chamber of Secrets, Harry goes to Diagon Alley using the Floo Powder mixup but he mispronounces "Diagon Alley" as "diagonally" out of a panic thanks to watching the effect of the Floo Powder as a demonstration. He then appears in a rather "dark" part of the Diagon Alley, but then he finds his way back and the overall relief feeling makes sense!
  • Towards the end of Harry Potter and the Prisoner of Azkaban, Harry and Hermione save Sirius Black and Buckbeak after traveling back in time. The final part is quite a relief as their efforts to save the situation are fulfilled.

Neutral

  • Although it is very exciting to talk about powerful feelings like happiness, sadness, hate... Harry Potter is characterized by one feeling above all others, the neutral one. As boring as that may sound, most of the sentences are characterized by that feeling as they are neither extremely happy nor sad. Moreover, since it is all dialogue, most of the words used are neutral in character. Therefore, it makes perfect sense to have obtained these results in the plot.
  • However, it is worth noting that the feelings of worry and neutral both have a similar impact in each scene. This helps determine the general nature of the Harry Potter saga.

Overall, neutrality and worry are the two most common emotions in the movie, whereas hate and enthusiasm are the least common. These were also two of the least common emotions in the training set so these results may not be accurate and may be a result of the dataset being imbalanced, despite having used SMOTE.

4. Discussion

Let summarize this journey about Harry Potter and the Graph of Secrets.

What could have been done in addition to those projects?

  • Comparison between Movies and Fandom: Creation of a network based on the movie's interactions such as if two characters are in the same scene, add a link. Compare this network with the one based on the Fandom wiki.
  • Comparison between books and Fandom: Creation of a network based on the book's interactions and compare to the one based on the Fandom wiki

5. Contributions

All team members have contributed equally to both the assignments and the final project, which was then written according to the plan stated below:

6. References

Data

The data used and needed for the analysis has been extracted and prepared for this project, and it is detailed below:

Book and papers